24 research outputs found

    In support of workload-aware streaming state management

    Full text link
    Modern distributed stream processors predominantly rely on LSM-based key-value stores to manage the state of long-running computations. We question the suitability of such general-purpose stores for streaming workloads and argue that they incur unnecessary overheads in exchange for state management capabilities. Since streaming operators are instantiated once and are long-running, state types, sizes, and access patterns, can either be inferred at compile time or learned during execution. This paper surfaces the limitations of established practices for streaming state management and advocates for configurable streaming backends, tailored to the state requirements of each operator. Using workload-aware state management, we achieve an order of magnitude improvement in p99 latency and 2x higher throughput.https://www.usenix.org/system/files/hotstorage20_paper_kalavri.pdfPublished versio

    A Survey on the Evolution of Stream Processing Systems

    Full text link
    Stream processing has been an active research field for more than 20 years, but it is now witnessing its prime time due to recent successful efforts by the research community and numerous worldwide open-source communities. This survey provides a comprehensive overview of fundamental aspects of stream processing systems and their evolution in the functional areas of out-of-order data management, state management, fault tolerance, high availability, load management, elasticity, and reconfiguration. We review noteworthy past research findings, outline the similarities and differences between early ('00-'10) and modern ('11-'18) streaming systems, and discuss recent trends and open problems.Comment: 34 pages, 15 figures, 5 table

    TVA: A multi-party computation system for secure and expressive time series analytics

    Get PDF
    We present TVA, a multi-party computation (MPC) system for secure analytics on secret-shared time series data. TVA achieves strong security guarantees in the semi-honest and malicious settings, and high expressivity by enabling complex analytics on inputs with unordered and irregular timestamps. TVA is the first system to support arbitrary composition of oblivious window operators, keyed aggregations, and multiple filter predicates, while keeping all data attributes private, including record timestamps and user-defined values in query predicates. At the core of the TVA system lie novel protocols for secure window assignment: (i) a tumbling window protocol that groups records into fixed-length time buckets and (ii) two session window protocols that identify periods of activity followed by periods of inactivity. We also contribute a new protocol for secure division with a public divisor, which may be of independent interest. We evaluate TVA on real LAN and WAN environments and show that it can efficiently compute complex window-based analytics on inputs of 2222^{22} records with modest use of resources. When compared to the state-of-the-art, TVA achieves up to 5.8Ă—5.8\times lower latency in queries with multiple filters and two orders of magnitude better performance in window aggregation

    The Future is Big Graphs! A Community View on Graph Processing Systems

    Get PDF
    Graphs are by nature unifying abstractions that can leverage interconnectedness to represent, explore, predict, and explain real- and digital-world phenomena. Although real users and consumers of graph instances and graph workloads understand these abstractions, future problems will require new abstractions and systems. What needs to happen in the next decade for big graph processing to continue to succeed?Comment: 12 pages, 3 figures, collaboration between the large-scale systems and data management communities, work started at the Dagstuhl Seminar 19491 on Big Graph Processing Systems, to be published in the Communications of the AC

    Performance Optimization Techniques and Tools for Data-Intensive Computation Platforms : An Overview of Performance Limitations in Big Data Systems and Proposed Optimizations

    No full text
    Big data processing has recently gained a lot of attention both from academia and industry. The term refers to tools, methods, techniques and frameworks built to collect, store, process and analyze massive amounts of data. Big data can be structured, unstructured or semi-structured. Data is generated from various different sources and can arrive in the system at various rates. In order to process these large amounts of heterogeneous data in an inexpensive and efficient way, massive parallelism is often used. The common architecture of a big data processing system consists of a shared-nothing cluster of commodity machines. However, even in such a highly parallel setting, processing is often very time-consuming. Applications may take up to hours or even days to produce useful results, making interactive analysis and debugging cumbersome. One of the main problems is that good performance requires both good data locality and good resource utilization. A characteristic of big data analytics is that the amount of data that is processed is typically large in comparison with the amount of computation done on it. In this case, processing can benefit from data locality, which can be achieved by moving the computation close the to data, rather than vice versa. Good utilization of resources means that the data processing is done with maximal parallelization. Both locality and resource utilization are aspects of the programming framework’s runtime system. Requiring the programmer to work explicitly with parallel process creation and process placement is not desirable. Thus, specifying good optimization that would relieve the programmer from low-level, error-prone instrumentation to achieve good performance is essential. The main goal of this thesis is to study, design and implement performance optimizations for big data frameworks. This work contributes methods and techniques to build tools for easy and efficient processing of very large data sets. It describes ways to make systems faster, by inventing ways to shorten job completion times. Another major goal is to facilitate the application development in distributed data-intensive computation platforms and make big-data analytics accessible to non-experts, so that users with limited programming experience can benefit from analyzing enormous datasets. The thesis provides results from a study of existing optimizations in MapReduce and Hadoop related systems. The study presents a comparison and classification of existing systems, based on their main contribution. It then summarizes the current state of the research field and identifies trends and open issues, while also providing our vision on future directions. Next, this thesis presents a set of performance optimization techniques and corresponding tools fordata-intensive computing platforms; PonIC, a project that ports the high-level dataflow framework Pig, on top of the data-parallel computing framework Stratosphere. The results of this work show that Pig can highly benefit from using Stratosphereas the backend system and gain performance, without any loss of expressiveness. The work also identifies the features of Pig that negatively impact execution time and presents a way of integrating Pig with different backends. HOP-S, a system that uses in-memory random sampling to return approximate, yet accurate query answers. It uses a simple, yet efficient random sampling technique implementation, which significantly improves the accuracy of online aggregation. An optimization that exploits computation redundancy in analysis programs and m2r2, a system that stores intermediate results and uses plan matching and rewriting in order to reuse results in future queries. Our prototype on top of the Pig framework demonstrates significantly reduced query response times. Finally, an optimization framework for iterative fixed points, which exploits asymmetry in large-scale graph analysis. The framework uses a mathematical model to explain several optimizations and to formally specify the conditions under which, optimized iterative algorithms are equivalent to the general solution.QC 20140605</p

    Integrating Pig and Stratosphere

    No full text
    MapReduce is a wide-spread programming model for processing big amounts of data in parallel. PACT is a generalization of MapReduce, based on the concept of Parallelization Contracts (PACTs). Writing efficient applications in MapReduce or PACT requires strong programming skills and in-depth understanding of the systems’ architectures. Several high-level languages have been developed, in order to make the power of these systems accessible to non-experts, save development time and make application code easier to understand and maintain. One of the most popular high-level dataflow systems is Apache Pig. Pig overcomes Hadoop’s oneinput and two-stage dataflow limitations, allowing the developer to write SQL-like scripts. However, Hadoop’s limitations are still present in the backend system and add a notable overhead to the execution time. Pig is currently implemented on top of Hadoop, however it has been designed to be modular and independent of the execution engine. In this thesis project, we propose the integration of Pig with another framework for parallel data processing, Stratosphere. We show that Stratosphere has desirable properties that significantly improve Pig’s performance. We present an algorithm that translates Pig Latin scripts into PACT programs that can be executed on the Nephele execution engine. We also present a prototype system that we have developed and we provide measurements on a set of basic Pig scripts and their native MapReduce and PACT implementations. We show that the Pig-Stratosphere integration is very promising and can lead to Pig scripts executing even more efficiently than native MapReduce applications.Att skapa effektiva applikationer i MapReduce eller PACT kråver goda programmeringskunskaper och djup förståelse utav systemens arkitektur. Flera högnivå-språk har utvecklats för att göra de kraftfulla systemen tillgängliga för icke-experter, för att spara utvecklingstid och för att göra applikationernas kod lättare att förstå. Ett utav de mest populära systemen för högnivå-dataflöden är Apache Pig. Pig överkommer Hadoops ett-input och tvånivå-begränsningar och låter utvecklaren skriva SQL-liknande skript. Dock är Hadoops begränsningar fortfarande närvarande i backend-systemet och lägger till ett synligt tillägg till exekutionstiden. Pig är för nuvarande implenterat ovanpåHadoop, dock har det designats för att vara modulärt och oberoende utav exekutionsmotorn. I det här exjobbs-projektet presenterar vi integration utav Pig med ett annat framework för parallel dataprocessering, Stratosphere. Vi visar att Stratosphere har önskade egenskaper som signifikant förbättrar Pigs prestanda. Vi presenterar en algoritm som översätter Pig Latin-skript till PACT-program som can köras påNepheleexekutionsmotorn. Vi presenterar ocksåett prototypsystem som vi har utvecklat och vi bidrar med mätningar utav ett set av grundläggande Pigskript och deras MapReduce och Pact-implementationer. Vi visar att Pig-Stratosphere-integrationen är väldigt lovande och kan leda till att Pigskript exekuteras mer effektivt än MapReduce applikationer
    corecore